Download a PDF of this article: Part 1, 2
Sometimes cultures collide. Microprocessor design has always emphasized clock frequency, hand design and careful tuning at the circuit level. SoC design emphasizes floor planning, synthesized blocks and careful attention to global routing. But as a recent design from PMC-Sierra illustrates, those two worlds are being forcibly joined by the growing needs of the network-processing market.
The processor group at PMC, formerly independent as Quantum Effect Designs, was charged with developing a dual-processor engine for networking applications. Each of the two MIPS CPUs would have to run at 1 GHz -- not breathtaking by Intel standards, but well beyond the envelope for most embedded-processor designs.
The added complexity came from another issue: bandwidth. To effectively link the two CPUs to a networking system, the chip needed a lot more than just an external bus. The two processors, each of which included L1 and L2 caches, would need a direct channel between them for interprocessor communications and cache coherency transactions. And the link among CPUs, memory and high-speed external buses would require something much more like a switch fabric than like a microprocessor bus. In fact, the team implemented a five-port switch with internal storage for this purpose.
The chip was to be implemented in TSMC 's 0.13LV process, with 1.2-V core supply voltage. The combination of very demanding CPU clock rate and complex, block-oriented overall chip architecture made the design of the x2 a hybrid -- demanding QED's deep experience in fast CPU design and techniques from the emerging world of deep-submicron SoC design. The result was a fascinating mix of design techniques rolled into a coherent methodology.
The clock frequency target,while technically demanding, was perhaps the most familiar challenge for the PMC design team. With 15 CPUs under its collective belt, the team hadthe right tools. But this was not going to be an instantiation of anything they'd done in the past. QED's processors had all used a five-stage pipeline. Analysis showed that to meet the clock target, the team would now have to use a seven-stage pipe, somewhat complicating the design,especially in the control logic. In addition, circuit design techniques would have to be updated. PMC enlarged its custom cell library by a factor of three for the new design, adding, for example, domino logic in place of static logic cells for some circuits.
The CPU design was otherwise conducted along pretty traditional lines. Rather than trusting critical paths to synthesis,the majority of the CPU was created at the schematic level, and then implemented by a combination of hand design and data path compilation. One significant departure was the use of extensive, detailed floor planning with a proprietary extraction tool. The designers floor planned early in the process, down to the level of assigning pins to particular locations on the perimeters of the blocks. Then the proprietary route-estimator and extractor tool provided parasitic estimates for early timing analysis. This kept timing closure under control without having to do the CPUs as single flat designs.
The rest of the system
But in this design, two CPUs were just the core of the chip. There was still significant work to be done in the inter-CPU connection, the high-speed I/O and the so-called shared-memory fabric. In contrast to the CPU cores, the design for these portions of the chip was conducted in a relatively standard ASIC-like flow. One key block was the interconnect between the two CPUs. Logically a buffered crossbar switch, but implemented with
multiplexers, this block had to provide an 8-Gbyte/second connection between the two CPU cores and a 4-Gbyte/s path to the shared-memory fabric. The other key was the shared memory itself. Logically, this was a 20-Gbyte/s,five-port switch linking the inter-CPU bus, the main memory bus and three different off-chip buses. Physically, it was implemented as a 500-MHz, five-port SRAM array. After the design of the five-port
RAM cell, the array was relatively conventional.
If the design was a straightforward blend of two cultures, the verification problem was just plain huge. The team quickly found that the chip required 10 times the simulation time of any previous design in its experience. But even though the software simulation was easier to bring up and most flexible, it proved intractable as the design moved to the gate level. For this the team moved to another tool from the CPU-design culture, hardware emulation, teamed with commercial verification languages.
Overall, the project represents an interesting blend of the hand-design school of CPU craftsmanship and the hierarchical synthesis approach to modern SoC design. The main lesson is that, in a properly planned methodology, the two approaches can in fact coexist.